What is the R package ggplot2

The R package ggplot2 is dedicated to data visualization in R. It can greatly improve the quality and aesthetics of your graphics, and will make you much more efficient in creating them.

The author of the package, Hadley Wickham, was awarded the most prestigious award for young statisticians, the COPSS Award, in 2019. Hadley Wickham also created other great packages, including tidyverse.

2019 COPSS Presidents’ Awardee: Hadley Wickham
2019 COPSS Presidents’ Awardee: Hadley Wickham

PART I - Creating your plot

# load the ggplot2 package in R
library(ggplot2)
Building Layers using ggplot2
Building Layers using ggplot2

ggplot2 builds charts through layers using geom_ functions, including geom_point, geom_line, geom_bar, geom_boxplot, geom_smooth, geom_tile, geom_violin, geom_hline, geom_vline, geom_histogram, geom_sf, geom_contour, geom_density, geom_hex, geom_jitter, geom_map, geom_area, geom_path,geom_segment,geom_qq, …

A World of GEOM
A World of GEOM

Data

# View a summary of the data "mpg"
mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

Basics

In order to create a plot, you:

  • Call the ggplot() function on your data which creates a blank canvas
  • Specify aesthetic mappings, which control how you want to map variables to visual aspects
  • Add new layers that are geometric objects which will show up on the plot
# create canvas
ggplot(data = mpg)

# map variables of interest
ggplot(mpg, aes(x = displ, y = hwy))

# add geom object
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

When we add the geom layer we use the addition (+) operator. As you add new layers you will always use + to add onto your visualization.

Aesthetic mappings

The aesthetic mappings take properties of the data and use them to influence visual characteristics, such as position, colour, size, shape, or transparency. Each visual characteristic can thus encode an aspect of the data and be used to convey information.

All aesthetics for a plot are specified in the aes() function call (later in this tutorial you will see that each geom layer can have its own aes specification).

# add a mapping from the class of the cars to a colour characteristic
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
  geom_point()

Note that ‘aesthetics’ in ggplot refers to WHAT is plotted, not HOW it is plotted, unlike the word’s usual meaning. Using the aes() function will cause the visual to be based on the data specified in the argument. For example, using aes(colour = "blue") won’t cause the geometry’s colour to be ‘blue’, but will instead cause the visual to be mapped from the vector c("blue") — as if we only had a single class of car that happened to be called ‘blue’. If you wish to apply an aesthetic property to an entire geometry, you can set that property as an argument to the geom method, OUTSIDE of the aes() call:

# illustrate the common mistake of trying to specify a colour within the aes() function
ggplot(mpg, aes(x = displ, y = hwy, colour = "blue")) +
  geom_point()

# here is the correct way
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(colour = "blue")

Exercise:

  1. Make a scatterplot of the relationship between the city and highway mileage of each car. Colour all the points red.
# YOUR CODE HERE
ggplot(mpg, aes(x = cty, y = hwy)) + 
  geom_point(colour = "red")

  1. Now, colour the points in your scatterplot by the number of cylinders each car has.
# YOUR CODE HERE
ggplot(mpg, aes(x = cty, y = hwy, colour = cyl)) + 
  geom_point()

Specifying geometries

Building on these basics, ggplot2 can be used to build almost any kind of plot you may want. These plots are declared using functions.

The most obvious distinction between plots is what geometric objects (geoms) they include. ggplot2 supports a number of different types of geoms, including:

  • geom_point()
  • geom_line()
  • geom_smooth()
  • geom_bar()
  • geom_boxplot()
  • geom_histogram()
  • geom_polygon()
  • geom_map()

Each of these geometries will leverage the aesthetic mappings supplied. For example, you can map data to the location, colour, and shape of a geom_point (e.g., points can be circles or squares), or you can map data to the linetype of a geom_line (e.g., solid or dotted).

Most geoms require an x and y mapping as a bare minimum.

# x and y mapping needed for geom_point and geom_smooth
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# no y mapping needed for geom_bar and geom_histogram
ggplot(data = mpg, aes(x = class)) +
  geom_bar()

ggplot(data = mpg, aes(x = hwy)) +
  geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise:

  1. Create a box-and-whisker plot of the engine displacement of the different drive types (front, rear, or 4wd).
# YOUR CODE HERE
ggplot(mpg, aes(x = drv, y = displ)) + 
  geom_boxplot()

  1. Create a box-and-whisker plot of the engine displacement of the different engine types (cylinders). What happens if your x variable is numeric? How might we fix that?
# YOUR CODE HERE
ggplot(mpg, aes(x = cyl, y = displ)) + 
  geom_boxplot()

ggplot(mpg, aes(x = cyl, y = displ, group = cyl)) + 
  geom_boxplot()

ggplot(mpg, aes(x = factor(cyl), y = displ)) + 
  geom_boxplot()

Layering geometries

What makes this approach really powerful is that you can add multiple geometries to a plot, allowing you to create complex graphics showing multiple aspects of your data.

# plot with both points and smoothed line
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Of course, the aesthetics for each geom can be different, so you could show multiple lines on the same plot (or with different colours, styles, etc). It’s also possible to give each geom a different data argument, so that you can show multiple data sets in the same plot.

For example, we can plot both points and a smoothed line for the same x and y variable but specify unique colours within each geom:

# same as above, but points red and line blue
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(colour = "blue") +
  geom_smooth(colour = "red")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

So if we specify an aesthetic within ggplot it will be passed on to each geom that follows. Or we can specify certain aes within each geom, which allows us to only show certain characteristics for that specific layer (i.e. geom_point).

# colour aesthetic passed to each geom layer
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
  geom_point() +
  geom_smooth(se = FALSE)

# colour aesthetic specified for only the geom_point layer
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = class)) +
  geom_smooth(se = FALSE)

Exercise:

  1. Take your boxplot from the previous exercise and add a layer that displays the values of the individual points on each box-and-whisker plot.
#YOUR CODE HERE
ggplot(mpg, aes(x = factor(cyl), y = displ)) + 
  geom_boxplot() + 
  geom_point()

Position adjustments

In addition to a default statistical transformation, each geom also has a default position adjustment which specifies how different components should be positioned relative to each other. This position is noticeable in a geom_bar if you map a different variable to the colour visual characteristic:

# bar chart of class, coloured by drive (front, rear, 4-wheel)
ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar()

The geom_bar by default uses a position adjustment of ‘stack’, which makes each rectangle’s height proportional to its value and stacks them on top of each other. We can use the position argument to specify what position adjustment rules to follow:

# position = "dodge": values next to each other
ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar(position = "dodge")

# position = "fill": percentage chart
ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar(position = "fill")

Check the documentation for each particular geom to learn more about its positioning adjustments.

Exercise:

  1. Take your box-and-whisker plot from the previous exercise and use position = "jitter" in the point geometry layer. What happens?
# YOUR CODE HERE
ggplot(mpg, aes(x = factor(cyl), y = displ)) + 
  geom_boxplot() + 
  geom_point(position = "jitter")

ggplot(mpg, aes(x = factor(cyl), y = displ)) + 
  geom_boxplot() + 
  geom_jitter(width = 0.2)

  1. Create a histogram of the distribution of highway mileage values in mpg. Colour the bars by the drive type. Experiment with the position argument in the histogram geometry. What happens?
# YOUR CODE HERE
ggplot(mpg, aes(x = hwy, fill = drv)) + 
  geom_histogram(position = "dodge") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mpg, aes(x = hwy, fill = drv)) + 
  geom_histogram(position = "fill")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 15 rows containing missing values (`geom_bar()`).

ggplot(mpg, aes(x = hwy, fill = drv)) + 
  geom_histogram(position = "stack")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

PART II - Refining your plot

Managing scales

Whenever you specify an aesthetic mapping, ggplot uses a particular scale to determine the range of values that the data should map to. Thus when you specify

# colour the data by engine type
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
  geom_point()

ggplot automatically adds a scale for each mapping to the plot:

# same as above, with explicit scales
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
  geom_point() +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

Each scale can be represented by a function with the following name: scale_, followed by the name of the aesthetic property, followed by an _ and the name of the scale. A continuous scale will handle things like numeric data (where there is a continuous set of numbers), whereas a discrete scale will handle things like colours (since there is a small list of distinct colours).

While the default scales will work fine, it is possible to explicitly add different scales to replace the defaults. For example, you can use a scale to change the direction of an axis:

# milage relationship, ordered in reverse
ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  scale_x_reverse() +
  scale_y_reverse()

Similarly, you can use scale_x_log10() and scale_x_sqrt() to transform your scale. You can also use scales to format your axes:

ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, by = .2), labels = scales::percent)

A common parameter to change is which set of colours to use in a plot. While you can use the default colouring, a more common option is to leverage the pre-defined palettes from colourbrewer.org. These colour sets have been carefully designed to look good and to be viewable to people with certain forms of colour blindness. We can leverage colour brewer palletes by specifying the scale_colour_brewer() function, passing the pallete as an argument.

# default colour brewer
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
  geom_point() +
  scale_colour_brewer()

# specifying colour palette
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
  geom_point() +
  scale_colour_brewer(palette = "Set3")

Note that you can get the palette name from the colourbrewer website by looking at the scheme query parameter in the URL. Or see the diagram here and hover the mouse over each palette for the name.

You can also specify continuous colour values by using a gradient scale, or manually specify the colours you want to use as a named vector.

Exercises

  1. Produce a scatter plot with displ on the x axis and hwy on the y axis, and colour by drv. Next, go to colourbrewer.org and find an appropriate colour palette that is colourblind safe. Update your plot with that pallete.
# YOUR CODE HERE
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2")

  1. Your plot should currently have three breaks on the y axis at 20, 30, and 40. Using scale_y_continuous(), try increasing the number of breaks so that there are breaks at 15, 20, 25, 30, 35 and so on. Hint: Check the help file using ?scale_y_continuous.
# YOUR CODE HERE
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2") +
  scale_y_continuous(n.breaks = 6)

Facets

Facets are ways of grouping a data plot into multiple different pieces (subplots). This allows you to view a separate plot for each value in a categorical variable. You can construct a plot with multiple facets by using the facet_wrap() function. This will produce a “row” of subplots, one for each categorical variable (the number of rows can be specified with an additional argument):

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(. ~ class)

You can also facet_grid() to facet your data by more than one categorical variable. Note that we use a tilde (~) in our facet functions. With facet_grid() the variable to the left of the tilde will be represented in the rows and the variable to the right will be represented across the columns.

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(year ~ cyl)

Exercises

  1. Produce a scatter plot with a smoothed line with displ on the x axis and hwy on the y axis. Facet the plot by drv using facet_grid().
# YOUR CODE HERE
ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_point() +
    geom_smooth() +
    facet_grid(drv ~ .)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  1. Now try setting scales='free' inside facet_grid()? What has changed?
# YOUR CODE HERE
ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_point() +
    geom_smooth() +
    facet_grid(drv ~ ., scales = 'free')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Labels & annotations

Textual labels and annotations (on the plot, axes, geometry, and legend) are an important part of making a plot understandable and communicating information. Although not an explicit part of the Grammar of Graphics (the would be considered a form of geometry), ggplot makes it easy to add such annotations.

You can add titles and axis labels to a chart using the labs() function (not labels, which is a different R function!):

ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
  geom_point() +
  labs(title = "Fuel Efficiency by Engine Power",
       subtitle = "Fuel economy data from 1999 and 2008 for 38 popular models of cars",
       x = "Engine power (litres displacement)",
       y = "Fuel Efficiency (miles per gallon)",
       colour = "Car Type")

Saving the plots

p <- ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
  geom_point() +
  labs(title = "Fuel Efficiency by Engine Power",
       subtitle = "Fuel economy data from 1999 and 2008 for 38 popular models of cars",
       x = "Engine power (litres displacement)",
       y = "Fuel Efficiency (miles per gallon)",
       colour = "Car Type")

pdf("myplot.pdf",width=6,height=4)

print(p)

dev.off()
## quartz_off_screen 
##                 2

Further Resources

A great website showcases the most if not all possibilities that you can do with ggplot2: https://r-graph-gallery.com/.